-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27528][SQL] Use Parquet logical type TIMESTAMP_MICROS by default #24425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #104775 has finished for PR 24425 at commit
|
|
Test build #104776 has finished for PR 24425 at commit
|
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we document this? persistence data format changes are potentially fairly impactful
|
Test build #104782 has finished for PR 24425 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good given that INT96 is deprecated in Parquet too, and apparently migrating looks fine 777b797
|
Test build #104797 has finished for PR 24425 at commit
|
|
Test build #104800 has finished for PR 24425 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Merged to master. |
The upstream default is INT96 but INT96 is considered deprecated by Parquet [1] and we rely internally on the default being INT64 (TIMESTAMP_MICROS). INT64 reduces the size of Parquet files and avoids unnecessary conversions of microseconds to nanoseconds, see [2]. Apache went down the same route in [2] but then reverted to remain compatible with Hive and Presto in [3]. [1] https://issues.apache.org/jira/browse/PARQUET-323 [2] apache#24425 [3] apache#28450
What changes were proposed in this pull request?
In the PR, I propose to use the
TIMESTAMP_MICROSlogical type for timestamps written to parquet files. The type matches semantically to Catalyst'sTimestampType, and stores microseconds since epoch in UTC time zone. This will allow to avoid conversions of microseconds to nanoseconds and to Julian calendar. Also this will reduce sizes of written parquet files.How was this patch tested?
By existing test suites.